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Abstract 

The goal of ranking and selection (R&S) procedures is to identify the best stochastic sys¬ 
tem from among a finite set of competing alternatives. Such procedures require constructing 
estimates of each system’s performance, which can be obtained simultaneously by running mul¬ 
tiple independent replications on a parallel computing platform. However, nontrivial statistical 
and implementation issues arise when designing R&S procedures for a parallel computing envi¬ 
ronment. Thus we propose several design principles for parallel R&S procedures that preserve 
statistical validity and maximize core utilization, especially when large numbers of alternatives 
or cores are involved. These principles are followed closely by our parallel Good Selection 
Procedure (GSP), which, under the assumption of normally distributed output, (i) guarantees 
to select a system in the indifference zone with high probability, (ii) runs efficiently on up to 
1,024 parallel cores, and (iii) in an example uses smaller sample sizes compared to existing 
parallel procedures, particularly for large problems (over 10 6 alternatives). In our computa¬ 
tional study we discuss two methods for implementing GSP on parallel computers, namely the 
Message-Passing Interface (MPI) and Hadoop MapReduce and show that the latter provides 
good protection against core failures at the expense of a significant drop in utilization due to 
periodic unavoidable synchronization. 


1 Introduction 

The simulation optimization (SO) problem is a nonlinear optimization problem in which the ob¬ 
jective function is defined implicitly through a Monte Carlo simulation, and thus can only be 
observed with error. Such problems are common in a variety of applications including transporta¬ 
tion, public health, and supply chain management; for these and other examples, see SimOpt. org 
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(Hen derson and P as u d at Ini 120141') . For over vi ews of methods to solv e the S O problem, see, e.g., 


(1994), Andradottirl (1998), 


Fu 


Fn et al. 


(2005), Pasupathv and Ghosh (2013). 


We consider the case of SO on finite sets, in which the decision variables can be categorical, 
integer-ordered and finite, or a finite “grid” constructed from a continuous space. Formally, the SO 
problem on finite sets can be written as 


max ^ = E[X(i;^)] 
i£S 


(1) 


where S = {1,..., k} is a finite set of design points or “systems” indexed by i, and £ is a random 
element used to model the stochastic nature of simulation experiments. (In the remainder of the 
paper we assume that Hi < H 2 < ■ ■ • < Hki so that system k is the best.) The objective function 
H : S —>• M cannot be computed exactly, but can be estimated using output from a stochastic 
simulation represented by X(-;£). While the feasible space S may have topology, as in the finite 
but integer-ordered case, we consider only methods to solve the SO problem in (pQ) that (i) do 
not exploit such topology or structural properties of the function, and that (ii) apply when the 
computational budget permits at least some simulation of every system. Such methods are called 
ranking and selection (R&S) procedures. 

R&S procedures are frequently used in simulation studies because structural properties, such as 
convexity, are difficult to verify for simulation models and rar ely hold. They can also be used in con 


junction with heuristic search procedures in a variety of ways (|Pichitlamken et al 


2003). making them useful even if not all systems can be simulated. See IKim and Nelso n (l2006al ) 


2006, 


Boesel et a’ 


for an excellent introduction to, and overview of, R&S procedures. We remark here that while R&S 
problems are closely related to best-arm problems, there are several differences between these bodies 
of literature. Almost always, the algorithms developed in the best-arm l iterature assume that only 
one system is simulated at a time (see, e.g., 


Jamieson and Nowak 


2014, 


Bubeck and Cesa-Bianchi 


2012) and that simulation outputs are bounded, or that all variances have a known bound. 

R&S procedures are designed to offer one of several types of probabilistic guarantees, and 
can be Bayesian or frequentist in nature. Bayesian procedures offer gua rante es related to a loss 
function associated with a non-optimal choice; see 


Branke et al. 


(120071 1 and 


Chen et al. 


(2015!, 


Chapter 3). Frequentist procedures typically offer one of two statistical guarantees; in defining 
these guarantees, let 6 > 0 be a known constant and let a € (0,1) be a parameter selected by 
the user. The Probability of Correct Selection (PCS) guarantee is a guarantee that, whenever 
Hk — Hk-i > <5, the probability of selecting the best system k when the procedure terminates 
is greater than 1 — a. Henceforth, the assumption that Hk ~ Hk—i — $ will be called the PCS 
assumption ; if Hk ~ Hk -1 < <5 then a PCS guarantee does not hold. In contrast, the Probability 
of Good Selection (PGS) guarantee is a guarantee that the probability of selecting a system with 
objective value within 5 of the best is greater than 1 — a. That is, the PGS guarantee implies PGS 
= P [Select a system K such that Hk ~ Hk < S\ > 1 — a. A PGS guarantee makes no assumption 
about the configuration of the means and is the same as the “probably approximately correct” 
guarantee in best-arm literature. 
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Traditionally, R&S procedures were limited to problems with a modest number of systems k, 
say k < 100, due to the need to assume worst-case mean configurations to construct validity p roo fs. 


The advent of screening, i.e., disca rding clearly inferior alternatives early on (|Nelson et al 


2001 


Kim and Nelson 


2006b, Hong 200 6). has allowed R&S to be applied to larger problems, say k < 500. 


Exploiting parallel computing is a natural next step as argued in, e.g., |Fu (2002). By employing 
parallel cores, simulation output can be generated at a higher rate, and a parallel R&S procedure 
should complete in a smaller amount of time than its sequential equivalent, and allowing larger 
problems to be s olved. 


Heidelbergeri (1198811 , iGlvnn and Heidelbergeri (119901 . 1 199 ill explored the use of parallel comput¬ 


ers to construct valid simulation estima tors, but R&S proc edures that exploit parallel computing 


have emerged only recently. 


Luo et al 


(2000) and 


Yoo et al. 


( 2009! ) employ a web-based comput¬ 


ing environment and present a parallel procedure under the optimal computing budget allocation 
(OCBA) framework. (OCBA has impressive empirical performance, but does not offer PCS or PGS 
guarantees.) C he n (20051 ) test s a se quential pairwise hypothesis testing approach on a local network 
of computers. More recently, Luo et al. (2013|) develop a parallel ada ptati on of^a fully-sequential 
R&S procedure that provides an asymptotic (as <5 —>• 0) PCS guarantee. 


Luo et al 


(2013j) is the best 

known existing method for parallel ranking and selection that provides a form of PCS guarantee 
on the returned solution. 

In this paper, we (i) identify opportunities and challenges that arise from adopting a parallel 
computing environment to solve large-scale R&S problems, (ii) propose a Good Selection Procedure 
(GSP) that solves R&S problems on parallel computers, and (iii) implement our procedure in two 
different parallel computing frameworks. We make the following contributions. 

Theoretical contributions. We propose a number of design principles that promote efficiency 
and validity in such an environment, and demonstrate them in a new parallel GSP. GSP showcases 
the power of these design principles in that it gre atly extends the boundary on the size of solvable 
R&S problems. While the method of (Luo e t al . 2013) can solve on the order of 10 4 systems, one 
of our implementations of GSP is capable of solving R&S problems with more than 10 6 systems. 
Our computational results include such a problem, which we solve in under 6 minutes on 10 3 cores. 
Another important theoretical contribution of this paper ja the rede signed screening method in 


GSP which, unlike many fully-sequential procedures ([Kim and Nelson 1 


2001 


Horml 120061 ). does not 


rely on the PCS assumption. Accordingly, many systems can lie within the indifference-zone, i.e., 
have an objective function value within 5 of that of System k , as will usually be the case when the 
number of systems is very large. Our procedure then provides the same PGS guarantee as existing 


indifference-zone procedures like 

Practical contributions. 


Nelson et al. 


(2001) but with far smaller sample sizes. 


The GSP procedure discussed in this paper is intended for any 
parallel, shared or non-shared memory platform where cores can communicate with each other. As 
long as no core fails during execution, it should deliver expected results regardless of the hardware 
specification. The procedure is also amenable to a range of existing parallel computing frameworks. 
We offer implementations of GSP based on MPI (Message-Passing Interface) and Hadoop MapRe- 
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duce, and show how they differ in construction and in performance. The reasons for our choice of 
implementation frameworks are twofold: 

• Both MPI and MapReduce are among the most popular and mature platforms for deploying 
parallel code, on a wide range of systems ranging from high performance supercomputers to 
commodity clusters such as Amazon EC2. 


• MPI and MapReduce provide points of comparison between two different parallel design 
philosophies. Broadly speaking, the former enables low level tailoring and optimization in 
the implementation of a parallel procedure, while the latter is more of a “one-size-fits-all” 
framework which delegates as much of the implementation complexity as possible to the 
MapReduce package itself. 


As we shall see, MPI is the more efficient of the two, achieving speed and utilization gains of 
around a factor of magnitude over MapReduce. On the other hand, MapReduce offers acceptable 
performance for large scale problems, and is more robust to reliability issues that may arise in 
cloud-computing environments where parallel tasks may fail to complete due to unresponsive cores. 

The remainder of the paper is organized as follows, ^discusses the design principles followed 
in creating GSP to promote efficiency and ensure the procedure’s validity. tj3] describes our multi¬ 
stage parallel R&S procedure GSP, and establishes the PGS guarantee. Computational studies in 
1|4] support our assertions on the quality of GSP and its parallel implementations, and point to 
open-access repositories where the code can be obtained. An appendix contains more proof detail, 
and further information on the MPI a nd Map Reduce implementations. This paper is a considerable 


outgrowth of the conference papers 


Ni et al. 


(120131 . l2014l201R ). 


2 Design Principles for Parallel R&S Procedures 

R&S procedures are essentially made up of three computational tasks: (1) deciding what simula¬ 
tions to run next, (2) running simulations, and (3) screening (computing statistical estimators and 
determining which systems are inferior). On a single-core computer, these tasks are repeatedly 
performed in a certain order until a termination criterion is met. On a parallel platform, multiple 
cores can simultaneously perform one or several of these tasks. 

In this section, we discuss various issues that arise when a R&S procedure is designed for and 
implemented on parallel platforms to solve large-scale R&S problems. We argue that failing to 
consider these issues may result in unpractically expensive or invalid procedures. We recommend 
strategies by which these issues can be addressed, and illustrate how we incorporate them in our 
procedure presented in which iteratively runs Tasks (1) through (3) in multiple stages. 

For discussing the design principles for parallel R&S procedures in this section, we consider a 
parallel computing environment that satisfies the following properties. 

Assumption 1. (Core Independence) A fixed number of processing units (“cores”) are employed 
to execute the parallel procedure. Each core is capable of performing its own set of computations 


4 









without interfering with other cores unless instructed to do so. Each core has its own memory and 
does not access the memory of other cores. 

Assumption 2. (Message-passing) The cores are capable of communicating through sending and 
receiving messages of common data types and arbitrary lengths. 

Assumption 3. (Reliability) Cores do not “fail” or suddenly become unavailable. Messages are 
never “lost”. 


Many parallel computer platforms satisfy the first two assumptions, but some are subject to 
the risk of core failure, which may interrupt the computation in various ways. For clarity, we work 


we discuss Hadoop MapReduce. Similar to 


Luo et al. 

(2013 

) and 

Ni et al. 

(2013) 


we consider a 


master-worker framework, using a uniquely executed “master” process (typically run on a dedicated 
“master” core) to coordinate the parallel procedure, and letting other cores (the “workers”) work 
according to the master’s instructions. To the extent possible we want to avoid synchronization 
delays, where one core cannot continue until another core completes its task, as we will see in 


2.1 Implications of Random Completion Times 


Consider the simplest case where only Task (2), running simulations, is run in parallel, and each 
simulation replication completes in a random amount of time. To construct estimators for a single 
system simulated by multiple cores, one can either collect a fixed number of replications in a random 
comp le tion time, or a ran dom number of replications in a fixed co mpletion time (jHeidelberger 


19881 1 . iHeidelbergert ()1988l l and iGlvnn and Heidelbergeri (jl990l . Il99lh discuss unbiased estimators 


of each type. Because a random number of replications collected after a fixed amount of time 
may not be i.i.d. with the desired distribution upon which much of the screening theory depends 


(jHei delb erger 


1988 


Glynn and Heidelberger 


1991 


Ni et al 


2013, 


Luo et al. 


2013), we confine our 


attention to estimators that produce a fixed number of replications in a random completion time. 
(The cause of this difficulty can be traced to dependence between the estimated objective function 
and computational time.) 

Using estimators that produce a fixed number of replications in a random completion time for 
parallel R&S places a restriction on the manner in which replications can validly be farmed out to 
and collected from the workers. Consider the case where more than one core simulates the same 
system, and replications generated in parallel are aggregated to produce a single estimator. A naive 
way is to collect replications from any core following the order in which they are generated, but as 
demonstrated in Ni et al. 


(2.013lj §3.1), the estimators may be biased, making it hard to establish 
provable statistical guarantees. In contrast, a valid method is to place the finished replications 
in a predetermined order and use them as if they are generated following that order, to avoid 
“re-ordering” of the simulation replications caused by random completion time. 

Under this principle, our GSP in 1J3] ensures that the simulation results generated in parallel are 
initiated, collected, assembled and used by the screening routine in an ordered manner. Specifically, 
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in Stage 2 of GSP, when the master instructs a worker to simulate system i for a batch of replications 
(Step Hc)[ ), the batch index is also received by the worker. When the batch is completed, its 
statistics are sent back to the master alongside the batch index (Step E |Kc)[ ), which signals its 
pre-determined position in the assembled batch sequence on the master. This ensures that the 


batch statistics sent to workers for screening (StepEKd) ) follow the exact order in which they were 
initiated, and constructed estimators are unbiased with the correct distribution. Lu o et aJj (12013) 
discuss a similar approach which they refer to as “vector-filling”. 


2.2 Allocating Tasks to the Master and Workers 


Previous work o n parallel R&S procedures (IChen et al.l l2000. 


Luo et al 


Yoo et al 


2009 


Luo and Hong 


2011 


20131 ) focuses almost exclusively on pushing Task (2), running simulations, to parallel 


cores. In those procedures, usually the master is solely responsible for Tasks (1) and (3), deciding 
what simulations to run next and screening, and the workers perform Task (2) in parallel. In this 
setting, the benefit of using a parallel computing platform is entirely attributed to distributing 
simulation across parallel cores, hence reducing the total amount of time required by Task (2). 

However^the master could potentially become a bottleneck in a number of ways. First, as 
noted by Luo and Hong (2011), the master can be overwhelmed with messages. Second, for the 
master to keep track of all simulation res ults requires a large amount of memory, especially when 
the number of systems is large (Luo et ah 2013j). Finally, when the number of systems is large and 
simulation output is generated by many workers concurrently, running Tasks (1) and (3) on the 
master alone may become relatively slow, resulting in a waste of core hours on workers waiting 
for the master’s further instructions. Therefore, a truly scalable parallel R&S procedure should 
allow its users a simple way to control the level of communication, use the memory efficiently, and 
distribute as many tasks as possible across parallel cores. In addition, it should perform some form 
of load-balancing to minimize idling on workers. 


2.2.1 Batching to Reduce Communication Load 

One way to reduce the number of messages handled by the master is to control communication 
frequency by having the workers run simulation replications in batches and only communicate once 
after each batch is finished. 

Since R&S procedures typically use summary statistics rather than individual observations when 
screening systems, it may even suffice for the worker to compute and report batch statistics instead 
of point observations from every single replication. Indeed, a useful property of our statistic for 
screening systems i and j is that it is updated using only the sample means over the entirety of the 
most recent batch r, instead of requiring the collection of individual replication outcomes. These 
sample means can be independently computed on the worker(s) running the rth batch of systems i 
and j. and the amount of communication needed in reporting them to the master is constant and 
does not grow with the batch size. 
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The distribution of batches in parallel must be handled with care. Most importantly, since 
using a random number of replications after a fixed run time may introduce bias (as we have 
shown in H2.ip . a valid procedure should employ a predetermined and fixed batch size for each 
system, which may vary across different systems. Batches generated in parallel for the same system 
should be assembled according to a predetermined order, following the same argument used in 
m\ Furthermore, if the procedure requires screening upon completion of every batch, then it is 
necessary to perform screening steps following the assembled order. 


2.2.2 Allocating Simulation Time to Systems 


When multiple systems survive a round of screening, R&S procedures need to decide which sys¬ 
tem^) to simulate next (possibly on multiple cores), and how many replications to take. While 
sequential procedures usually sample one replication from the chosen system(s), or multiple repli¬ 
cations from a single system, it is natural for a parallel procedure to consider strategies that sample 
multiple replications from multiple systems. In doing so, the parallel procedure may adopt sam¬ 
pling strategies such that simulation resources are allocated to surviving systems in a most efficient 
manner. 

The bes t prac tice in making such allocations depends on the specific screening method. For 


instance, in IHond (2006!) as well as GSP, screening between systems i and j is based on a scaled 


Brownian motion B([af/rii + of/rij] -1 ) where £>(•) denotes a standard Brownian motion (with 
zero drift and unit volatility), n* is the sample size and of is the variance of system i. To dri ve 
this Brownian motion rapidly with the fewest samples possible, which accelerates screening, .Hong 
(|2006 ) recommended that the ratio rii/oi be kept equal across all surviving systems. 

The above recommendation implicitly assumes that simulation completion time is fixed for all 
systems, and is suboptimal when completion time varies across systems. Suppose all workers are 
identical, and each replication of system i takes a fixed amount of time X) to simulate on any 
worker. We can then formulate the problem of advancing the above Brownian motion as 


max [of/rii + a?/rij\ 1 
s.t. riiTi + rijTj = T 


which yields the optimal computing time allocation 


njTj _ Ojy/Ti 

n jTj 


This result is consistent with a conclusion in 


Glvnn and Whittl (|1992l ). that when simulation com¬ 


pletion time Ti varies, an asymptotic measure of efficiency per replication is inversely proportional 


to ofE[Ti\. 

In practice, X) is unknown and possibly random, so both E[Tf\ and <r 2 need to be estimated in a 
preliminary stage. Suppose they are estimated by some estimators X) and Sf. Then we recommend 
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Figure 1: Comparison of screening methods applied on 50 systems. Each black or green dot 
represents a pair of systems to be screened. In the left panel, all pairs of screening is done on 
the master. In the right panel, each worker core gets 10 systems, screens between themselves, 
and screens its systems against one system from every other worker that has the highest sample 
mean. 


setting the batch size for each system i proportional to £) / \J~Ti following ([2]) . 


2.2.3 Distributed screening 


In fully sequential R&S procedures, e.g., Kim and Nelson (2006b), IHongl (2006), each screening 
step typically involves doing a fixed amount of calculation between every pair of systems to decide 
if one system is better than another with a certain degree of statistical confidence. The amount of 
work is proportional to the number of pairs of systems, which is 0(k 2 ). 

In the serial R&S literature, the computational cost of screening is assumed to be negligible 
compared to that of simulation because the number of systems k is usually quite small and each 
simulation replication may take orders of magnitude longer than 0(k 2 ) screening operations re¬ 
quired in each iteration. Under this assumption, it is tempting to simply have the master handle 
all screening after the workers complete a simulation batch. This approach can easily be imple¬ 
mented and proven to be statistically valid. However, it may become computationally inefficient 
because all workers stay idle while the master screens, so a total amount of 0(ck 2 ) processing time 
is wasted, where c is the number of workers. For a large problem with a million systems solved on a 
thousand cores, the wasted processing time per round of screening can easily amount to thousands 
of core hours, reducing the benefits from a parallel implementation dramatically. Moreover, if the 
procedure requires computing and storing in memory some quantities for each system pair (for 
instance, the variance of differences between systems), then the total amount of 0(k 2 ) memory 
may easily exceed the limit for a single core. 

It is therefore worth considering strategies that distribute screening among workers. A natural 
strategy is to assign roughly k/c systems to each worker, and let it screen among those systems 


































































































only, as illustrated in Figure [T] By doing so, each worker screens k/c systems, occupying only 
0(k 2 /c 2 ) memory, and performing 0(k 2 /c 2 ) work in parallel. Hence the wall-clock time for each 
round of screening is reduced by a factor of c 2 . 

Under the distributed screening scheme, not all pairs of systems are compared, so fewer systems 
may get eliminated. The reduction in effectiveness of screening can be compensated by sharing 
some good systems across workers. In Figure [TJ for example, each core shares its own (estimated) 
best system with other cores, and each system is screened against other systems on the same core, 
as well as 0(c) good systems from other cores. This greatly improves the chance that each system 
is screened against a good one, despite the extra work to share those good systems. As illustrated 
in Figurc[l] the additional number of pairs that need to be screened on each core is only 0(k) when 
the best system on each core is shared. Alternatively, the procedure may also choose to share only 
a smaller number </ Ccof good systems, so that the communication workload associated with this 
sharing does not increase as the number of workers increases. 

The sta tistical validity of some screening-based R&S procedures 


Hong 2006, 


Luo et al. 


(e.g. 


Kim and Nelson 


2001 


2013) requires screening to be performed once every replication (or batch 


of replications) is simulated. This implies that, when the identity of the estimated-best system(s) 
changes, the master has to communicate all previous replication results of the new estimated-best 
system(s) to the workers, so that they can perform all of the screening steps up to the current 
replication to ensure validity of the screening. (If screening on a strict subsequence of replications, 
it may be suffici ent to c ommunic ate summary statistics.) Such “catch-up” screening was used, for 
instance, in Pi chitlamken et al. ( 20061 ). in a different context. In fj3j we employ a probabilistic 


bound that removes the need for catch-up screening in GSP. 

Besides core hours, distributing screening across workers also saves memory space on the master. 
In our implementation of GSP, the master keeps a complete copy of batch statistics only for a small 
number of systems that are estimated to be the best. For a system that is not among the best, 
the master acts as an intermediary, keeping statistics for only the most recent batches that have 
not been collected by a worker. Whenever some batch statistics are sent to a worker (for screening 


in Steps [^cj] or [f ](d) of GSP), they can be deleted on the master. This helps to even out memory 
usage across cores, making the procedure capable of solving larger problems without the need to 
use slower forms of storage. 


2.3 Random Number Stream Management 

The validity and performance of simulation experiments and simulation optimization procedures 
relies substantially on the quality and efficiency of (pseudo) random num ber ge nerators. For a 
discussion of random number generators and their desirable properties, see L’Ecuyer (2006)■ 

To avoid unnecessary synchronization, each core may run its own random number generator 
independently of other cores. Some strategies for generating in dep endent random numbers in 
parallel have been proposed in the literature. Masca gni and Srin ivasan (2000) consider a class of 
random number generators which are parametrized so that each valid parametrization is assigned 
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to one core. 


Karl et al. 


(2014) adopt iL’Ecuver et al. (2002)’s RngStream package, which supports 
streams and substreams, and demonstrated a way to distribute RngStream objects across parallel 
cores. 

Both methods set up parallel random number generation in such a way that once initialized, 
each core will be able to generate a unique, statistically independent stream of pseudo random 
numbers, which we denote as U w , for each w = 1,2,..., c. If a core has to switch between systems 
to simulate, one can partition U w into substreams {U^ : i = 1,2,. .., k}, simulating system i using 
U^ only. It follows that for any system i, U l w for different w are independent as they are substreams 
of independent U w 's, so simulation replicates generated in parallel with {U^ : w = 1,2,..., c} are 
also i.i.d. Moreover, if it is desirable to separate sources of randomness in a simulation, it may help 
to further divide U^ into subsubstreams, each used by a single source of randomness. 

In practice, one does not need to pre-compute and store all random numbers in a (sub)stream, 
as long as jumping ahead to the next (sub)stream and switching between different (su b)str e ams 


are fast. Such operations are easily achievable in constant computational cost; see lL’Ecuver et al . 
(]20Q2) for an example. 

Although our procedure does not support the use of common random numbers (CRN), it is worth 
noting that the above framework easily extends to accommodate CRN as follows. Begin by having 
one identical stream Uq set up on all cores and partitioning it into substreams {Uq(£) : 1 < £ < L} 
for sufficiently large L. Let the master keep variables {£i : i = 1,2,... ,k} which count the total 
number of replications already generated for system i over all workers. Each time the master 
initiates a new replication of system i on a worker, it instructs the worker to simulate system i 
using substream {Uo(£i + 1)} and adds 1 to This ensures that for any £ > 0, the It h replication 
of every system is generated by the same substream {Uq(£)}. 


3 The Parallel Good Selection Procedure 


In this section, we provide a R&S procedure GSP that incorporates the design principles from 
(21 and is implementable on a wide spectrum of parallel platforms. Our procedure applies to the 
general case in which the system mean and variance are both unknown and need to be est imated 
(an earlier version of the procedure under the known variance case is discussed in Ni et al. 2014|), 


and does not permit the use of common random numbers. We prove that the procedure offers a 
PGS guarantee for normally distributed observations. 


3.1 The Setup 

GSP consists of four broad stages. In an optional Stage 0, workers run no simulation replications 
for each system in parallel to estimate completion times, which are subsequently used to try to 
balance the workload. As discussed in H2.2.21 Stage 0 samples are then dropped and not used to 
form estimators of /V s due to the potential correlation between simulation output and completion 
time. In Stage 1, a new sample of size n\ is collected from each system to obtain variance estimates 
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Sf = Y^LiiXu — Xi{ni)) 2 /{n\ — 1), where Xi{n) = Xu/n. Prior to Stage 2, obviously 
inferior systems are screened. In Stage 2, the workers iteratively visit the remaining systems and 
run additional replications, exchange statistics and independently perform screening over a subset 
of systems until either all but one are eliminated, or a pre-specified limit on sample size is reached. 
The screening rule and the limit on sample size are jointly chosen such that inferior systems can be 
eliminated efficiently, while the best system k survives this stage with high probability regardless of 
the configuration of true means fii, ... ,Hk- Finally, in Stage 3, all systems surviving Stage 2 enter 
a Rinott (1978) procedure where a maximum sample size is calculated, additional replications are 


simulated if necessary, and the system with the highest sample mean is selected as the best. 

The sampling rules used in Stages 0, 1, and 3 are relatively straight forward, for they each 
require a fixed number of replications from each system. In Stage 2, where the procedure iteratively 
switches between simulation and screening, a sampling rule needs to be specified to fix the number 
of additional replications to take from each system before each round of screening. Prior to the start 
of the overall selection procedure we define increasing (in r) sequences { m(r ) : i = 1, 2,... , k, r = 
0,1,...} giving the total number of replications to be collected for system i by batch r, and let 
rii( 0) = ni since we include the Stage 1 sample in mean estimation. Following the discussion in 
112.2.21 where we recommend that batch size for system i be proportional to Si/sfTi in order to 
efficiently allocate simulation budget across systems, we use 


nj(r) = n\ + r 



(3) 


where 7) is an estimator for simulation completion time of system i obtained in Stage 0 if available, 
and (3 is the average batch size and is specified by the user. 

The parameters for the procedure are as follows. Before the procedure initiates, the user selects 
an overall confidence level 1 — a, type-I error rates aq, «2 such that a = aq+a^i an indifference-zone 
parameter 6, Stage 0 and Stage 1 sample sizes no,ni > 2, and average Stage 2 batch size (3. The 
user also chooses r > 0 as the maximum number of iterations in Stage 2, which governs how much 
simulation budget to spend in iterative screening before moving to indifference-zone selection in 
Stage 3. 

Typical choices for error rates are aq = «2 = 0.025 for guaranteed PGS of 95%. The indifference- 
zone parameter 5 is usually chosen within the context of the application, and is often referred to 
as the smallest difference worth detecting. The sample sizes no and n\ are typically chosen to be 
small multiples of 10, with the view that these give at least reasonable estimates of the runtime 
per replication and the variance per replication. 

For non-normal simulation output, we recommend setting (3 > 30 to ensure normally distributed 
batch means. The parameter f3 also helps to control communication frequency so as not to over¬ 
whelm the master with messages. Let T s i m be a crude estimate of the average simulation time 
(in seconds) per replication, perhaps obtained in a debugging phase. Then ideally the master 
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communicates with a worker every /3T s [ m /c seconds. If every communication takes T comm seconds, 
the fraction of time the master is busy is p = cT comm / /3T s ; m . We recommend setting (3 such that 
p < 0.05, in order to avoid significant waiting of workers. 

We recommend choosing f such that a fair amount of simulation budget (no more than 20% of 
the sum of Rinott sample sizes) will be spent in the iterative screening stages. Note that a small f 
implies insufficient screening whereas a large f may be too conservative. 

Under these general principles, our choices of (/? = 100, f = 10) and (/3 = 200, r = 5) in the 
experiments in 2] work reasonably well on our testing platform, but it is conceivable that other 
values could improve performance. 

Finally, we define some quantities used in the iterative screening stages. Let p be the solution 
to 


E 


2<b pVR 


= l-(l-ai) 


1 

fc-1 


(4) 


where <!>(■) denotes the complementary standard normal distribution function, and R is the mini¬ 
mum of two i.i.d. x 2 random variables, each with n\ — 1 degrees of freedom. Let the distribution 

function and density of such a x 2 random variable be denoted F v i (x) and / 2 (x), respectively. 

—1 — 1 

Hence R has density /r(x) = 2[1 — F^. ^(x)\f x 2 i (x). Also, for any two systems i ^ j, define 
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( r) = V\/(ni - 1 )Tij{r). 


Zij(r) = tij(r)[Xi(ni(r)) - X^n^r))}, 


Yij(r) = Tij(r)[Xi(ni(r)) - Xj(nj{r))], 


3.2 Good Selection Procedure under Unknown Variances 

1. (Stage 0), optional Master assigns systems to workers, so that each system i is simulated 
for no replications and the average simulation completion time Tj is reported to the master. 

2. (Stage 1) Master assigns systems to load-balanced (using T % if available) simulation groups 
Gf for w = 1,..., c. Let 1 <r- S be the set of surviving systems. 

3. For w = 1, 2,... , c in parallel on workers: 

(a) Sample Xu, £ = 1,2,..., n\ for all i € G™. 

(b) Compute Stage 1 sample means and variances X t (n \) and S' 2 for i E Gf. 

(c) Screen within group Gf: system i is eliminated (and removed from X) if there exists a 
system j E Gf : j % i such that Uj(0) < ~o-ij(r). 

(d) Report survivors, together with their Stage 1 sample means W(nj(0)) and variances S’ 2 , 
to the master. 
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4. (Stage 2) Let G\ <— X be the set of systems surviving Stage 1. Master computes sampling rule 

(131) using Sf obtained in Stage 1, and divides G\ to approximately load-balanced screening 
groups Gf for w = 1, Let Si <— 0,i € G\ be the count of the number of batches 

simulated in Stage 2 for system i. 

5. For w = 1, 2,... , c in parallel on workers, let r w •(— 0 be the count of the number of batches 
screened on worker w (which is common to all systems in the screening), and iteratively switch 
between simulation and screening as follows (this step entails some communication with the 
master, the details of which are omitted): 


(a) Check termination criteria with the master: if \X\ = 1 (only one system remains) or 
r w > f for all w (each worker has screened up to f, the maximum number of batches 
allowed), go to Stage 3; otherwise continue to Step [ ffib)j 

(b) Decide to either simulate more replications or perform screening based on available 
results: check with the master if the (r w + l)th iteration has completed for all systems 
i € Gif and \Gif\ > 1, if so, go to Step [ ffidj] otherwise go to Step iH 

(c) Retrieve the next system i in G\ (not necessarily Gif) from the master and simulate it 
for an additional nfisi + 1) — rii(si) replications. Set s* <— Si + 1. Report simulation 
results to the master. Go to Step [ Jfa)| 

(d) Screen within Gif as follows. Retrieve necessary statistics for systems in Gf from the 
master (recall that a system in Gif is not necessarily simulated by worker w). Let r w «— 
r w + 1. A system i € Gf is eliminated if r w < f and there exists a system j € Gf : j i 
such that Yij(r w ) < —aijif). Also use a subset of systems from other workers, e.g., those 
with the highest sample mean from each worker, to eliminate systems in Gf. Remove 
any eliminated system from Gf and X. Go to Step Eft a) [ 


6. (Stage 3) Let G^ 4— X be the set of systems surviving Stage 2. If k' := | C? 2 1 = 1, select 
the single system in G 2 as the best. Otherwis e, set h = h{ 1 — ot 2 ,ni, h r ), where the function 


h(-) gives Rinott’s constant (see e.g. Bechhofer et al. 1995, Chapter 2.8). For each remaining 
system i € G 2 , compute Ni = max{rij(f), \(hSi/5 ) 2 ~\}, and take an additional max{IVj — 
n,(f), 0} sample observations in parallel. Once a total of Ni replications have been collected 
in Stages 1 through 3 for each i € G 2 , select the system K with the highest X(Nk) as the 
best. 


3.3 Guaranteeing Good Selection 

Our probabilistic guarantee on the final solution relies on the following assumption on the distri¬ 
bution of simulation output, which is common to the sequential R&S literature. 

Assumption 4. For each system i = 1,2,... ,k, the simulation output random variables {Xn,i = 
1,2,...} are i.i.d. replicates of a random variable X\ having a normal distribution with finite mean 
Hi and finite variance of, and are mutually independent for different i. 
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In ^2.1l we gave conditions under which the simulation output generated by parallel cores satisfies 
the i.i.d. assumption. 

We now formally state our good selection guarantee. 

Theorem 1 . Under Assumption GSP selects a system K that satisfies pk — UK < 8 with 
probability at least 1 — a. 

We discuss the key insights that yield the PGS guarantee here and defer the full proof of 
Theorem |Tj to i]A.ll 


Sketch of Proof. First, we show that the best system, System k, survives the iterative screening in 
Stages 1 and 2 with probability at least 1 — aq, irrespective of whether the best solution is unique 
or not. Indeed, conditioning on the Stage 1 variance estimates {Sf : 1 = 1,2,... ,/c}, we can, for 
any system i k, relate the batch statistics Zkfir) : r = 0,1,... , f to a properly scaled Brownian 
motion with drift Pk — Pi > 0- Then, using the reflection principle of Brownian motion, we can 
upper-bound the probability that the scaled Brownian motion falls below some number —a before 
some time t, or equivalently, the probability that Ykfir) falls below —akfir) in some rth iteration 
where r < r, which is the criterion used to eliminate system k in the iterative screening stages. 
The construction of continuation region parameter r] and the fact that ( n\ — \)Sf /af ~ Xni-l f° r 
all i jointly ensure that the unconditional probability of eliminating k is no greater than a\. 

Second, as Stage 3 is closely related to th elRin ott (119781 ) procedure with confidence level 1 — 02 , 
it follows from Theorem 1 of 


Ne lson and Mateicik (1 999 ) that 


P { X k (N k ) - XfiNi) - (ji K - m) > -(5, Vz € G 2 : i + K} > 1 - o 2 


where K is the system with the highest sample mean after Stage 3. Therefore we conclude that if k € 
G 2 (the best system k survives Stages 1 and 2), then P {Xk{Nk) — Xk(Nk) — (pk ~ Uk) > > 


1 — o 2 , that is, Stage 3 selects a good system K such that u nj > U k — 8 wit 
Finally, we complete the proof by invoking a result from 
an overall PGS of 1 — ctq — o 2 for two-stage procedures. 


Nelson et al. 


i high probability. 


12001) that guarantees 


□ 


The key difference between the screening methods used i n GSP and the KJV family procedures 


([Kim and Nelson 


2001 


Hong and Nelsonl 120051 . iHond 120061 1 is that the ICJ\f family relies on the 


PCS assumption (//& — Pi > S > 0 for all i k ) to guarantee PCS, whereas our approach does not. 
Therefore, GSP works for any indifference-zone parameter 5 > 0, and when there exist multiple 
systems i such that Pi > Pk ~ 8, GSP is guaranteed to select one such system with probability at 
least l — a. 


3.4 Choice of parameter p 

One way to compute rj, the solution to Q, is by integrating the LHS using Gauss-Laguerre quadra¬ 
ture and using bisection to find the root of (0]) . Alternatively, we may employ a bounding technique 
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to approximate 77 as follows. Indeed, the LHS of © is 
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where © holds because distribution functions are non-negative and is inspired by a similar argu¬ 


ment in 


Hong (2006), © follows from the fact that <j>(x) < e~ x / 2 /(xx/Zk) for all x > 0, and the 
integrand in © is the pdf of a Gamma distribution with shape (n\ — l )/2 and scale 2 /(rj 2 + 1), 
and hence integrates to 1 . 

Note that © is an upper-bound on the left-hand side of ©. Setting © to 1 — (1 — a \and 
solving for 77 yields an overestimate ?/, which is more conservative and does not reduce the PGS. 
Furthermore, as © is strictly decreasing in 77 , 77 ' can be easily determined using bisection. 

The parameter 77 determines the value of Ojj(f), and hence how quickly an inferior system 
is eliminated in screening Steps ©c)| and [f (d) The value of 77 therefore directly impacts the 


effectiveness of the iterative screening. Hence, it is desirable that 77 does not grow dramatically as 
the problem gets bigger. Observe that © can be further bounded by 


2T( 


ni—2 i 


— rvd-m 


% /7rT(2i2 J -)?7 n i 


-:= C 77 


(9) 


Setting © to 1 — (1 — ai) k ~ 1 implies that the right-hand side of 
further manipulations we have 


must be small. After some 


log(l — ai) = {k — 1 ) log (l — C77 1 ni ) « (k — 1 )(— C77 1 ni ) 


( 10 ) 


where the approximation holds because log(l — e) ~ —e for small e > 0. It follows from (1101) that 
for fixed aq, the parameter 77 grows very slowly with respect to k, at a rate of fc 1 /!”! -1 ). Therefore, 
the continuation region defined by 77 and r as well as the power of our iterative screening are not 
substantially weakened as the number of systems increases, especially when ni or k is large. In this 
regime, we should expect the total cost of this R&S procedure to grow approximately linearly with 
respect to the number of systems. 


15 































4 Computational Study 


In this section, we discuss our parallel computing environment and test problem, discuss two parallel 
implementations of GSP, and discuss the results of our numerical experiments. 


4.1 Parallel Computing Environment and Test Problem 


Our numerical experiments are conducted on Extreme Science and Engineering Discovery Environ¬ 
ment (XSEDE)’s Stampede high-performance cluster. The Stampede cluster contains over 6,400 
computer nodes, each equipped with two 8 -core Intel Xeon E5 pro cessors a nd 32 GB_of memory 
and runs a Linux Centos 6.3 operating system (Texas Advanced Computing Center 2014). Paral¬ 
lel programs are submitted through the Simple Linux Utility for Resource Management (SLURM) 
batch environment which enables users to specify the number of cores to use. The high-performance 
processors on Stampede are highly reliable and we have never seen a core failure. 

We test R&S pro ced ures on a throughput-maximization problem taken from SimOpt.org 
({Henderson and Pasu pathv 2014). In this problem, we solve 


max E\g(x;£)} 

X 

s.t. n + r 2 + r 3 = R 
b 2 + b 3 = B 

x = (n, r 2 , r 3 , b - 2 , b 3 ) S { 1 , 2 ,.. ,} 5 


( 11 ) 


where the function g{x \£) represents the random throughput of a three-station flow line with finite 
buffer storage in front of Stations 2 and 3, denoted by b 2 and 63 respectively, and an infinite 
number of jobs in front of Station 1. The processing times of each job at stations 1, 2, and 3 are 
independently exponentially distributed with service rates rq, r 2 and r 3 , respectively. The overall 
objective is to maximize expected steady-state throughput by finding an optimal (integer-valued) 
allocation of buffer and service rate. 

For each choice of the problem parameters R, B £ Z + , the number of feasible allocations is 
finite and can be easily computed. We consider three problem instances with very different sizes 
presented in Table |T] Since the service times are all exponential, we can analytically compute 
the expected throughput of each feasible allocation by modeling the system as a continuous-time 
Markov chain. Furthermore, it can be shown that E[g(ri,r 2 ,r 3 ,b 2 ,b 3 ; £)\ = E[g(r 3 , r 2 , n, 63 , b 2 ] £)] 
for any feasible allocation (ri,r 2 ,r 3 ,b 2 ,b 3 ), so the problem may have multiple optimal solutions. 
Therefore, this is a problem for which the PCS assumption g,/, — Hk-i > 5 > 0 can be violated and 
R&S procedures that only guarantee correct selection might be viewed as heuristics. 

By default, in each simulation replication, we warm up the system for 2,000 released jobs starting 
from an empty system, before observing the simulated throughput to release the next 50 jobs. This 
may not be the most efficient way to estimate steady-state throughput compared to taking batch 
means from a single long run, but it suits our purpose which is to obtain i.i.d. random replicates 
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Table 1: Summary of three instances of the throughput maximization problem. 


Number of Highest pth. percentile of system means No. of systems in [pk — 5, pk\ 

( R , B) systems k mean /r*. p = 75 p = 50 p = 25 5 = 0.01 <5 = 0.1 <5 = 1 


(20,20) 

3,249 

5.78 

3.52 

2.00 

1.00 

6 

21 

256 

(50,50) 

57,624 

15.70 

8.47 

5.00 

3.00 

12 

43 

552 

(128,128) 

1,016,127 

41.66 

21.9 

13.2 

6.15 

28 

97 

866 


from the g(x ; £) distribution in parallel. Due to the fixed number of jobs, the wall-clock time for 
each simulation replication exhibits low variability. 

4.2 Parallel Implementations of GSP 

In this section, we discuss two implementations of GSP proposed in f|3j Although we will primarily 
test them on Stampede, both procedures can be configured to run on a wide range of parallel 
platforms from multi-core personal computers to the Amazon EC2 cloud. 


4.2.1 MPI 


Message-Passing Interface (MPI) is a popular distributed-memory parallel programming framework 
with libraries available in C/C+-1- and Fortran and the de-facto standard for parallel programming 
on many high-performance parallel clusters including Stampede. Using MPI, programs operate in 
an environment where Assumptions 1 and 2 hold. The method by which parallel cores independently 
execute instructions and communicate through message-passing can be highly customized to serve 
different purposes. 

The MPI implementation is a realization of GSP presented in and is fully described in 
4A.2I We designate one core as the master and let it control other worker cores. We observe that 
communication is fast on Stampede, taking only 10 -6 to 1CD 4 seconds each time depending on 
the size of the message. Therefore, with an appropriate choice of the batch-size parameter /3, the 
master remains idle most of the time so the workers are usually able to communicate with the 
master without much delay. 

Our MPI implementation is designed primarily for high-performance clusters like Stampede and 
does not detect and manage core failures. As simulation output is distributed across parallel cores 
without backup, the MPI implementation is vulnerable to core failures which may cause loss of 
data and break the program. In practice, such failures can occur for a number of reasons, including 
faulty hardware but also cores aborting due to being re-assigned to higher priority tasks by the 
system. Therefore, for cheap and less reliable parallel platforms, the MPI implementation needs to 
be augmented with a “fault-tolerant” mechanism in order to allow the procedure to continue even 
if some cores fail. This motivates us to seek alternative programming tools such as MapReduce 
that handle core failures automatically. Our MPI implementation uses the m vapicli2 library and 
its source code and documentation is hosted in the open-access repository Ni (J2015b). 
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4.2.2 Hadoop MapReduce 


MapReduce (IDean an d Ghema wat 2008) is a distributed computing model typically used to process 
large amounts of data. Conceptually, each MapReduce instance consists of a Map phase where data 
entries are processed by “Mapper” functions in parallel, and a Reduce phase where Mapper outputs 
are grouped by keys and summarized using parallel “Reducer” functions. MapReduce has become 
a popular choice for data intensive applications such as PageRank and TeraSort, thanks to the 
following advantages. 


• Simplicity. The MapReduce programming model allows its users to solely focus on design¬ 
ing meaningful Mappers and Reducers that define the parallel program, without explicitly 
handling the complex details of the message-passing and the distribution of workload to cores, 
a task which is completely delegated to the MapReduce package. 

• Portability. Hadoop is a highly popular and portable MapReduce system that can be easily 
deployed with minimal changes on a wide range of parallel computer platforms such as the 
Amazon EC2 cloud. 

• Resilience to core failures. On less reliable hardware where there is a non-negligible 
probability of core failure, the Apache Hadoop system can automatically reload any failed 
Mapper or Reducer job on another worker to guide the parallel job towards completion. 


Despite these advantages, the use of MapReduce for computationally intensive and highly iter¬ 
ative applications, such as simulation and simulation optimization, is less documented. Moreover, 
most popular MapReduce implementations such as Apache Hadoop have limitations that may 
potentially reduce the efficiency of highly iterative algorithms such as parallel R&S procedures. 


• Synchronization. By design, each Reduce phase cannot start before the previous Map 
phase finishes and each new MapReduce iteration cannot be initiated unless the previous 
one shuts down completely. Hence, a R&S procedure using MapReduce has to be made 
fully synchronous. If load-balancing is difficult, for instance as a result of random simulation 
completion times, then core hours could be wasted due to frequent synchronization. 


• Absence of Cache. In Apache Hadoop, workers are not allowed to cache any information 
between Map and Reduce phases. As a result, the outputs generated by Mappers and Reduc¬ 
ers are often written to a distributed hie system (which are usually located on hard drives) 
before they are read in the next iteration. Compared to the MPI implementation where all 
intermediate variables are stored in memory, the MapReduce version could be slowed down 
by repeated data I/O. Moreover, the lack of cache requires the simulation program, including 
any static data and/or random number generators, to be initialized on workers before every 
MapReduce iteration. 


• Nonidentical Run Times. By default, Apache Hadoop does not dedicate each worker to 
a single task. It may simultaneously launch several Mappers and Reducers on a single worker, 
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Table 2: Major differences between GSP implementations 


Task 

MPI 

Hadoop MapReduce 

Master 

Explicitly coded 

Automated 

Message-passing 

Explicitly coded 

Automated 

Synchronization 

Once after each stage 

More frequent: required in every 
iteration of Stage 2 

Simulation 

Each worker simulates one system 
per iteration 

Each worker simulates multiple 
systems per iteration 

Load-balancing 

Via asynchronous 
communications between the 
master and a single worker 

By assigning approximately equal 
number of systems (Mappers) to each 
worker in each synchronized iteration 

Batch statistics and 
random number seeds 

Always stored in workers’ memory 

Written to hard disk after each 
iteration 


run multiple MapReduce jobs that share workers on the same cluster, or even use workers 
that have different hardware configurations. In any of theses cases, simulation completion 
times may be highly variable and time-varying. Therefore, Stage 0 of GSP (estimation of 
simulation run time) is dropped from our MapReduce implementation. 


Although there are specializ ed Map Reduce variants such as “stateful MapReduce” that attempt 
to address these limitations (Elgohary 2012), we do not explore them as they are less available for 
general parallel platforms, at least at present. However, some of these limitations (such as the lack 
of caching across multiple MapReduce rounds) are idiosyncratic to specific packages like Hadoop 
rather than the framework itself. Nevertheless, the a priori expectation is that, for a highly iterative 
procedure like ours, a highly optimized MPI approach will outperform a Hadoop one; thus our 
question is not which is fastest, but whether MapReduce can offer most of the speed of MPI along 
with its advantages discussed above. 

We propose a variant of GSP using iterative MapReduce as follows. In each Mapper function, 
we treat each surviving system as a single data entry, obtain an additional batched sample, and 
output updated summary statistics such as sample sizes, means, and variances. Each output entry 
is associated with a key which represents the screening group to which it belongs. Once output 
entries of Mappers are grouped by their keys, each Reducer receives a group of systems, screens 
amongst them, and writes each surviving system as a new data entry which in turn is used as the 
input to the next Mapper. 

To fully implement GSP, MapReduce is run for several iterations. The first iteration implements 
Stage 1, where both X t (ri\ ) and Sf are collected. Then, a maximum number of f subsequent 
iterations are needed for Stage 2, with only n*(r) and Xi(rii(r)) being updated in each iteration. 
(Additional MapReduce iterations can be run where the best system from each group is shared for 
additional between-group screening.) The same Reducer can be applied in both Stages 1 and 2, as 
the screening logic is the same. Finally, a Stage 3 MapReduce features a Mapper that calculates 
the additional Rinott sample size, simulates the required replications, and a different Reducer that 
simply selects the system with the highest sample mean at the end. Full details of each stage are 
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Table 3: A comparison of procedure costs using parameters no = 20, n\ = 50, <x\ = 0.2 = 2.5%. 
/3 = 100, f = 10. (Results to 2 significant figures) 


Configuration 

6 

Procedure 

Wall-clock time (sec) 

Total number of 
simulation replications (xlO 6 ) 

3,249 systems 

0.01 

GSP 

14 

2.3 

on 64 cores 


NHH 

14 

2.5 



NSGSp 

120 

13 


0.1 

GSP 

3.4 

0.57 



NHH 

2.6 

0.44 



NSGSp 

3.4 

0.48 

57,624 systems 

0.01 

GSP 

720 

130 

on 64 cores 


NHH 

520 

89 



NSGS P 

11,000 

1600 


0.1 

GSP 

60 

10 



NHH 

71 

12 



NSGSp 

150 

23 

1,016,127 systems 

0.1 

GSP 

260 

320 

on 1,024 cores 


NHH 

1,000 

430 



NSGSp 

1,400 

1900 


provided in f]A.3l 

Our MapReduce implementation is based on the native Java interf ace for MapReduce provided 
in Apache Hadoop 1.2.1. It is hosted in the open-access repository Ni (2015ai). Table [2] summarizes 
some of the major differences between the MPI and Hadoop implementations. 


4.3 Numerical Experiments 

We now demonstrate the practical performance of GSP by using it to solve the throughput maxi¬ 
mization test problem. 


4.3.1 GSP vs Existing Parallel Procedures 

GSP is motivated by an earlier computational study by 


formance of two parallel procedures, NHH (Met ah 2013 ) and NSGSp (Ni et ah 2014). NHH is a 


Ni et ah 


(120141 ) , which compa res the per 


parallel procedure that adopts the fully-sequential ser ial procedure proposed in iHond (|2006l ) , and 


only provides a PCS guarantee (Hong and Nelson 2014). For problems where multiple optimal solu¬ 
tions exist, it is used as a heuristic because the PCS assumption does not hold. NHH can be viewed 
as a variant of GSP using a different screening method and without a_ Rinott-like Stage 3. NSGSp 
is a parallel implementation of the NSGS procedure ([N elson et ah 2001), and is a simplification of 
GSP without the iterative screening Stage 2. 

We implement all three procedures using MPI and test them on different instances of the 
throughput maximization problem. We measure the performance of these procedures in terms 
of total wall-clock time and simulation replications required to find a solution, and report them 
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Table 4: A comparison of two implementations of GSP using parameters S = 0.1, no = 50, 
a\ = 0.2 = 2.5%, f = 1000//3. “Total time” is summed over all cores. (Results to 2 significant 
figures) 


Configuration 

P 

Version 

Number of 

Wall-clock 

Total time 

Utilization 




replications 

time 

Simulation 

Screening 





(xlO 6 ) 

(sec) 

(xlO 3 sec) 

(sec) 

% 

3,249 systems 

100 

HADOOP 

0.46 

460 

0.34 

0.14 

1.2 

on 64 cores 


MPI 

0.50 

3.0 

0.18 

0.01 

94 


200 

HADOOP 

0.63 

280 

0.41 

0.10 

2.3 



MPI 

0.69 

4.1 

0.25 

0.01 

95 

57,624 systems 

100 

HADOOP 

8.8 

550 

5.1 

1.9 

15 

on 64 cores 


MPI 

9.1 

53 

3.3 

0.89 

98 


200 

HADOOP 

12 

410 

7.0 

1.7 

27 



MPI 

13 

75 

4.7 

0.83 

98 

1,016,127 

100 

HADOOP 

280 

1300 

160 

120 

12 

systems 


MPI 

320 

120 

110 

30 

91 

on 1,024 cores 

200 

HADOOP 

340 

810 

190 

89 

23 



MPI 

380 

140 

140 

29 

97 


in Table [3l Preliminary runs on smaller test problems suggest that the variation in these two 
measures between multiple runs of the entire selection procedure are limited. Therefore we only 
present results from a single replication to save core hours. 

(2014) argue that NHH tends to devote excessive simulation effort to systems with 


Ni et al. 


means that are very close to the best, whereas NSGS P has a weaker screening mechanism but its 
Rinott stage can be effective when used with a large 6 , which is associated with higher tolerance of 
an optimality gap. GSP, by design, combines iterative screening with a Rinott stage. Like NSGS P , 
we expect that GSP will cost less with a large 6 as the Rinott sample size is 0(1/5 2 ), but its 
improved screening method should eliminate more systems than NSGS P before entering the Rinott 
stage. Therefore, we expect GSP to work particularly well when a large number of systems exist 
both inside and outside the indifference zone. This intuition is supported by the outcomes of the 
medium and large test cases with 5 = 0.1, when GSP outperforms both NHH and NSGS p . 


4.3.2 A Comparison of MPI and Hadoop Versions of GSP 

We now focus on GSP and test its two implementations discussed in 114.21 Since Stage 0 is not 
included in the MapReduce implementation, we also remove it from the MPI version to have a fair 
comparison. Both procedures are tested on Stampede. While the cluster features highly optimized 
C++ compilers and MPI implementations, it provides relatively l ess supp ort for MapReduce. Our 
MapReduce jobs are deployed using the myhadoop software iLockwoodl 2014 ). which sets up an 
experimental Hadoop environment on Stampede. 

Another difference is that we perform less screening in MPI than in Hadoop. In our initial 
experiments, we observed that the master could become overwhelmed by communication with the 


21 















Figure 2: A profile of a MapReduce run solving the largest problem instance with k = 
1, 016,127 on 1024 cores, using parameters op = «2 = 2.5%, S = 0.1, /3 = 200, f = 5. 


workers in the screening stages, and we fixed this problem by screening using only the 20 best 
systems from other cores, versus the best systems from all other cores in Hadoop. While less 
screening is not a non-negligible effect, it will be apparent in our results that it is dominated by 
the time spent with simulation. 

Before we proceed to the results, we define core utilization, an important measure of interest, 
as 


Utilization 


total time spent on simulation 
wall-clock time x number of cores 


Utilization measures how efficiently the implementations use the available cores to generate simu¬ 
lation replications. The higher the utilization, the less overhead the procedure spends on commu¬ 
nication and screening. 

In Table [4] we report the number of simulation replications, wall-clock time, and utilization for 
each of the GSP implementations. The MPI implementation takes substantially less wall-clock time 
than MapReduce to solve every problem instance, although it requires slightly more replications 
due to its asynchronous and distributed screening. The gap in wall clock times narrows as the batch 
size /3 and/or the system-to-core ratio are increased. Similarly, the MPI implementation also yields 
much higher utilization, spending more than 90% of the total computation time on simulation runs 
in all problem instances. Compared to the MPI implementation, the MapReduce version utilizes 
core hours less efficiently but again its utilization significantly improves as we double batch size 
and increase the system-to-core ratio. 

To further understand the low utilization, we give the number of active Mapper and Reducer 
jobs over an entire MapReduce run in Figure [2J The plot reveals a number of reasons for low 
utilization. First, there are non-negligible gaps between Map and Reduce phases, which are due to 
an intermediary “Shuffle” step that collects and sorts the output of the Mappers and allocates it 
to the Reducers. Second, as the amount of data shuffled is likely to vary, the Reducers start and 
finish at different times. Third, owing to the varying amount of computing required for different 
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Table 5: A comparison of GSP implementations using a random number of warm-up job 
releases distributed like min{exp(X), 20, 000} , where X ~ N(/j,,a 2 ). We use parameters S = 
0.1, no = 50, ai = CU 2 = 2.5%, /? = 200, f = 5. (Results to 2 significant figures) 


Configuration 


u 2 

Version 

Wall-clock time (sec) 

Utilization % 

3,249 systems 

7.4 

0.5 

HADOOP 

280 

2.3 

on 64 cores 



MPI 

4.2 

94 


6.6 

2.0 

HADOOP 

280 

2.0 




MPI 

4.0 

93 

57,624 systems 

7.4 

0.5 

HADOOP 

400 

27 

on 64 cores 



MPI 

74 

98 


6.6 

2.0 

HADOOP 

400 

26 




MPI 

70 

98 

1,016,127 systems 

7.4 

0.5 

HADOOP 

850 

25 

on 1,024 cores 



MPI 

150 

97 


6.6 

2.0 

HADOOP 

850 
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MPI 

150 

97 


systems, some Mappers take longer than others. In all, the strictly synchronized design of Hadoop 
causes some amount of core idleness that is perhaps inherent in the methodology, and therefore 
unavoidable. Nevertheless, the fact that utilization increases as average batch size (3 or the system- 
to-core ratio increases suggests that the Hadoop overhead becomes less pronounced as the amount 
of computation work per Mapper increases. Therefore we expect utilization to also improve and 
become increasingly competitive with MPI’s for problems that feature a larger solution space or 
longer simulation runs. 

4.3.3 Robustness to Unequal and Random Run Times 

The MapReduce implementation allocates approximately equal numbers of simulation replications 
to each Mapper and the simulation run times per replication are nearly constant for our test 
problem, so the computational workload in each MapReduce iteration should be fairly balanced. 
Indeed, in Figure [2] we observe that Mapper jobs terminate nearly simultaneously, which suggests 
that load-balancing works well. However, if the simulation run times exhibit enough variation that 
one Mapper takes much longer than the others, then we would expect synchronization delays that 
would greatly reduce utilization. 

To verify this conjecture, we design additional computational experiments where variability in 
simulation run times is introduced by warming up each system for a random number W of job 
releases (by default, we use a fixed 2,000 job releases in the warm-up stage). We take W to be 
(rounded) log-normal, parameterized so that the average warm-up period is approximately 2,000, 
in the hope that the heavy tails of the log-normal distribution will lead to occasional large run times 
that might slow down the entire procedure. We also truncate the log-normal distributions from 
above at 20,000 job releases to avoid exceeding a built-in timeout limit in Hadoop. Parameters of 
the truncated log-normal distribution and the results of the experiment are given in Table [5j 
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We observe very similar wall-clock time and utilization in all instances compared to the base 
cases in Table |4] where we used fixed warm-up periods. Both implementations seem quite robust 
against the additional randomness in simulation times, despite our intuition that the MapReduce 
version might be noticeably impacted due to additional synchronization waste. A potential ex¬ 
planation is that as each core is allocated at least 50 systems and each system is simulated for 
an average of 200 replications in each step, the variation in single-replication completion times is 
averaged out. Rather extreme variations would be required for MapReduce to suffer a sharp per¬ 
formance decrease. For problems with much longer simulation times and a lower systenrs-to-core 
ratio, the averaging effect might not completely cancel the variations across simulation run times. 
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Appendices 


A.l Proof of Theorem [0 

Proving Theorem [T] requires the following lemmas, where we use £>a(') to denote a Brownian motion 
with drift A and volatility one. 


Lemma 1. Hong 2006, Theorem 1) Let m{r ) and n(r) be arbitrary nondecreasing integer-valued 

T o l -1 

functions ofr = 0,1,... and i, j be any two systems. Define Z(m , n ) := of /m + cr|/n pQ(m) — 
Xj(n)] and Z'(m,n ) := ([a 2 /m + cr|/n] -1 ). Then the random sequences {Z(m(r ), n(r)) : 

r = 0,1,...} and {Z' (m(r) , n(r)) : r = 0,1,...} have the same joint distribution. 

Lemma 2. Let i j be any two systems. D efine djfir ) := min{Sf /af, S?/a 2 Aaij{r) and Uj(r) := 


min {S?/crf,Sj/(Tj}Tij(r). It can be shown (Hone 


2006) that mm{Sf/af,S 2 /aj} < Lfij)/T tJ (r) for 


allr > 0 regardless of the sampling rulesnfi-) andnj(-). Therefore dij(r ) < Uj(r)aij(r) / Tij{r) and Uj{r) < 
^ij( r ) T ij(^)/ T ij(r) regardless of the sampling rules nfi-) and nj(-) for all r > 0. 


Lemma 3. (Hong 20 0 6, Lemma 4) Let gi(-), be two non-negative-valued functions such that 

g 2 (t') > gift') for all t! > 0. Define symmetric continuations C m := : —gm{t') < x < 

and let T m := inf{t' : B\(t') 0 C m } form = 1,2. If A > 0, then P[B^(Ti) < 0] > P[B/^(T- 2 ) < 0]. 

Lemma 4. By the reflection principle of Brownian motion, P[mino<t'<t B$(t') < —a] = 2P[Bo(t) < 
—a] = 24>(a/Vt) for all a,t > 0. 


Lemma 5. 


Tamhane 


1911 ) Let V\, V 2 , ■ ■ ■, Vk be independent random variables, and let G w (yi,v 2 ,..., Vk), 


j = 1,2 be non-negative, real-valued functions, each one nondecreasing in each of its argu¬ 

ments. Then 

E 


Lemma 6. (After 


l[G w (V 1 ,V 2 ,...,V k ) 

3 = 1 


Nelson and Matejcik 


> J] E[G w (V u V 2 ,...,V k )}. 

3 = 1 


1995 

and 

Nelson et al. 

2001 


200i . Lemma 1) For any G 2 C S, 


Stage 3 guarantees to select a system K £ G 2 such that Pr [maxjgG 2 hi ~ hK < S\ > 1 — a 2 . If, in 
addition, Stages 1 and 2 jointly guarantee that Pr[fe £ G 2 ] > 1 — a.\, then 


Pr [The procedure selects system K : — gx < S\ > 1 — a\ — a 2 . 
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Proof. Proof of Theorem |T] 


First, note that for any system i, it is well known (jCasella and Berger 


2002, page 218) that 


X t (ni) | S'f is normally distributed and Xu is independent of Sf for all l > n\. Furthermore, T) is 
obtained in Stage 0 independently of all X^s. Therefore, choosing the sampling rule based on T* 
and Sf does not affect the normality of the {Xi(rii(r)) : r = 0,1,..., r} sequence. 

For any two systems i and j, let KOij be the event that system i eliminates system j in 
Stages 1 or 2. It then follows that 


Pi[KOik in Stages 1 or 2] 

=E[Pv[KOik in Stages 1 or 2|Sf,S) 2 ]] 

<E[Pi[Y ki (r ki (r)) < - ai 3 (r ) for some r < r\S k ,Sf]] 

since system i could be eliminated by some other system before it can eliminate system k 
=E[Pr[Y ki (T ki (r)) < -a^(f) and T ki (r ) < r ki (r) for some r\S k ,Sf}] 

=E[Pi-[Z ki (t ki (r)) < aij{f) and t ki {r) < T ki {f) for some r|Sf,Sf]] 

=E[Pr[B flk _^.(t ki (r)) < - a^-(f) and t ki (r) < tkl ^\K k i(f) for some r\S k ,Sf]] by Lemnial 

T k i{r) r ki [r) 

^ElPrlB^-^ltkitr)) < and t ki (r) < t l3 (f) for some r\S k ,Sf]] by Lemmas [2] and [3] 

<Fi[Pr[S Atfc _ w (t) < -dij(r) for some t < tij(r)\S k , Sf]] 

<E[Pv[Bo(t) < —dij{r) for some t < hj(f)\S k , Sf]] since /i k > /ij 
dij(r) 


=E 


=E 


=E 


2d> 


2<F 




by Lemma 0] 


aij (r) 


(m - 1)5? (wi - l)Sl ] 

VBjWjP - 1) V * l a i ’ a k J, 


i mm 


2$ r]\ min 


(ni - 1 )Sf (ni - 1 )S% 


07 


07 


by choice of ajj(r) 


=1 — (1 — ai) fc - 1 by 0, since (ni — 1 )Sf/aj and (ni — l)S k /a k are i.i.d. x ni -i random variables. 
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Master Core Routine 

Input: List of systems S] Average Stage 2 batch size /3; 
Parameters 5,ai,a2,no,ni,f and a random 
number seed. 

begin Preparation: Setting up random number 
streams 

Initialize random number generator using the seed] 
foreach worker w = 1,2,..., c do 

Generate a new random number stream U w ', 
Send U w to w, 

end 

end 

begin Stage 0: Estimating simulation completion time 
{Gl : w = 1,2,..., c} «—Partition(<S, 0); 
foreach worker w = 1,2,... ,c do 
| Send G®j to Worker w, 

end 

Collect(Ti); 

end 

begin Stage 1: Estimating sample variances 
{G\j : w = 1, 2,..., c} •<—Partition(<S, 1); 
foreach worker w = 1,2,..., c do 
| Send G\j to Worker w, 

end 

Collect^? and Stat^o); 

{tSjGjy} <— RecvScreen(w;); 

end 


Worker Core Routine 

Input: List of systems <S; Parameters 5, ot\, 02 , no, n\. 

begin Preparation: Setting up random number 
streams 

Receive random number stream U w m , 

Initialize random number generator using Uw ; 

end 


begin Stage 0: Estimating simulation completion time 
Receive the set of systems to simulate, G®,; 
foreach system i E G®, do 
| Simulate^, no, simulation time T*); 

end 

Return {T* : i E G^} to master; 

end 

begin Stage 1: Estimating sample variances 
Receive the set of systems to simulate, G^,; 
foreach system i E G^, do 
| Simulate^, ni, (S?, Stat^o)); 

end 

Return {(5?, Static)) : i £ G w } to master; 

G w Screen(Gl,, 0, 0, false)] 

SendScreen(Gj ; ); 

end 


Figure A.2.1: Stages 0 and 1, MPI Implementation: Master (left) and workers (right) routines 


Then, noting that simulation results from different systems are mutually independent, we have 


Pr [system k E <Si] = E 


' k—l 


Pi^f)KO ik \X kl ,X k2 ,...'j 

k -1 

l[Pr{KO ik \X kl ,X k2 ,...} 

i=l 

> P E [Pr {KO ik \X k i,X k2 ,...}] by Lemma ED 


= E 

k -1 


i =1 
k -1 


k -1 


n Pr [Ko ik \ > n [1 ~ (i - a ~ ai )^ 


= 1 — Ot\. 


1=1 


i =1 


Finally, we invoke Lemma [6] to complete the proof. 


□ 


A.2 Full Description of the MPI implementation 


The purpose of this section is to provide additional insight into our parallel codes. In Figures [A. 2. II 
through IA.2.31 we demonstrate in greater detail how the master core allocates and distributes 
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begin Stage 2: Iterative screening 

G\ <— systems that survived Stage 1; 

{G™ : w = 1,2,..., c} •<—Partition(Gi, 2); 

S+-Gi; 

foreach worker w = 1, 2,..., c do 
Send Gi, G™ to Worker w; 
foreach system i E Gi do 
| Send Sf from Stage 1 to Worker it?; 

end 

foreach system i E G™ do 
| Send Statto worker it?; 

end 

end 

bi •*— BatchSize(i, /3), qi •*— 1 for all i E Gi; 
r“ nt i — 0, received «_ 0, greened <_ 0; flagw ( 
for all it? = 1 , 2,. .., c; 

while |<S| > 1 and r % Teened < f for some w do 
Wait for the next worker it? to call 
Communicate(); 
if flag w = 1 then 

/* Send screening task to worker it? 
*/ 

{i, qi , Stati )(?i } <— RecvOutput(it?); 
else if flag w = 2 then 

/* Send simulation task to worker w 
*/ 

(S, G ™, r^ reened } ^RecvScreen(id); 
{C,r™ ceived ,{Stat is> , r :r < 

r rece lv ed}} <_R ecv Best(«!); 

end 

if \S\ > 1 then 

rcurrent ^ C ountBatch(it?); 
if r current > r s u f nt then 

flag^ «— 2; SendAction(ic,flagtu); 
SendStats(lt?, r^ ent , r current). 

SendBestStats(it?); 

^.sent j _ current. 

1 W ' ' ’ 

else 

flag^ •*— 1; SendAction(it?, flag^); 
Select next i E S such that 
Qi = ^Global; 

SendSim(it?, i, qi, bi); 

Qi Qi + 1; 

if qi > q Global for all i E S then 
[ ^Global ^Global + R 

end 

end 

end 

end 

Send a termination instruction to all workers; 

end 


begin Stage 2: Iterative screening 

Receive the set of systems that survived, G\; 

Receive the set of systems to screen, G™; 
foreach System i E Gi do 
| Receive Sf collected in Stage 1; 
end 

foreach system i E G^ do 
| Receive Stat^o from the master; 
end 
r w •*— 0; 

Communicate (); 

while No termination instruction received do 

flag^ •<—RecvAction(); 
if flag w = 2 then 

{r new , {Stat i>r : i E G%,r w + 1 < r < r new }} 

<— RecvStats(); 

{W, { r re , ceived : w' E W}, 

{Stat i* /tr :w' E W, r < r ™ ceived }} 

<— RecvBestStats(); 

GV/ <— Screei^G?^, r w + 1, r new , true); 
r w 4— r new ; 

Communicate(); 

SendScreenfG^, r w ); SendBest(r); 

else 

{i,qi,bi} RecvSim(); 

Simulate(i, bi, Stat i, qi ); 

Communicate(); 

SendOutput(i, qi, Static) 

end 

for i = l to r do 

[ Simulate(i, ni, Xi) for one batch; 

end 

Screen among r sys. using current batch stats.; 
for each batch k up to the current one do 
If batch k stats, available, screen the r 
systems simulated, against the best 
systems from the other workers, at batch 
k; 

end 

if Master sends a terminate instruction then 
| continue-*— false; 

else if Master is ready to communicate then 
Report indexes of eliminated sys. to 
master; Report stats, of the best system 
for each batch simulated up to this point; 
Receive stats, of the best sys. from other 
workers; 

end 

end 

end 


Figure A.2.2: Stage 2, MPI Implementation: Master (left) and workers (right) routines 
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begin Stage 3: Rinott Stage 

G 2 <— systems that survived Stage 2; 
if IG 2 I = 1 then 

| Report the single surviving system as the best; 
else 

h 4— h( 1 — Q!2,ni, IG 2 I); 
foreach system i E G2 do 

Ni <r- max{«i(f), |"(/iSi/<5) 2 ]}; 

^ysent ^ _ q. pjreceived ^_ q. 

end 

f lag™ 4 — 0 for all w = 1,2,..., c; 

while N[ ecewed < Ni — rii(r) for some i E G2 

do 

Wait for the next worker w to call 
Communicate(); 
if flag w = 1 then 

Receive i, b^ and sample mean of the 
current batch; 

Merge sample mean into Xi; 

^received j _ yyreceived _i_ u/ 

end 

if N? ent < Ni — rii (r) for some i E G 2 

then 

Find an appropriate batch size 
b\ = min {bi, Ni - m(f) - Nt extsent } 
for system v, 

Send system i and b^ to worker w; 

at sent . yysent 1 U. 

x x 

flag?i)-<— 1; 

end 

end 

Report the system i* = argmaxi^G 2 Xi(Ni) 
as the best; 

end 

Send a termination instruction to all workers; 

end 


begin Stage 3: Rinott Stage 
Communicate(); 

while No termination instruction received do 

Receive a system i and batch size b'i from the 
master; 

Simulate system i for bi replications; 
Communicate(); 

Send i, b^ and sample mean of the b^ 
replications to the master; 

end 

end 


Figure A.2.3: Stage 3, MPI Implementation: Master (left) and workers (right) routines 
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systems, how random number streams are created and distributed together with the assigned 
systems to ensure independent sampling, and how simulation results are communicated between 
cores. 

We use the following notation for some subroutines in Figures I A. 2. II through [A. 2. 31 

Partition(5, Stage ) The master divides the set of systems S into disjoint partitions {G‘g tage : 
w = l,2,...,c}: 

In Stage 0, all systems are simulated for no replications to estimate simulation completion 
time. The master randomly permutes S (in case of long runtimes for some systems that are 
indexed closely) and assigns approximately equal numbers of systems to each G“. 

In Stage 1, a fixed number n\ of replications are required from each system. To balance the 
simulation work among workers, the master chooses Gf such that the estimated completion 
time YheCf n i^V n o is approximately equal for all w. 

In Stage 2, both simulation and screening are performed iteratively. Simulation of a system 
is no longer dedicated to a particular worker, and G™ is the set of systems that worker w 
needs to screen. To load-balance the screening work, the master assigns approximately equal 
numbers of systems to each G™■ 

Collect (info) The master collects info from all workers for all systems, in arbitrary order. 

Simulate (i, n, info) Worker w simulates system i for n replications and records info using the 
next subsubstreanr in U 

Stat; jr The batch statistics for the rth batch of system i. This includes sample size nj(r) and 
sample mean Xj(rij(r)) as described in fjHJ 

BatchSize(i, /3) The master calculates batch size bi system i used in Stage 2. Following the 
recommendation from 112.2.21 we let 


where (3 is a pre-determined average batch size. 



bi = 


SiVTi 


1ST Y.jes s iVTj 


P 


Screen(G"’, ro, ri, useothers ) Screen systems in G w from batches rg through r\ inclusive. It can 
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be checked that worker w has received Statj r for all i E 5, all r < ri and stored the data in 
its memory. 

A system i E G“ is eliminated if there exists system j E G?f : j ^ i and some r' : ro < r' < ?'i 
such that r' <r and Yij(r') < —a,ij(f) where Yjj and a Vj are defined in fj3j 

In addition, if useothers= true and W ^ I, then for each w' E W the worker also screens 
the systems in G w against system i*,, the best system from worker w', using batch statistics 
{Statj* r t : r' < r w /} up to batch min{r^/,ri}. 

SendScreen(G^, r w ) and RecvScreen(ui) Worker w sends r w and screening results (updated G w ) 
to the master, which then updates G w and S on its own memory accordingly. The master 
also receives r w and lets r ^ reened Tw _ 

Communicate() Worker sends a signal to master and waits for the master to receive the signal, 
before proceeding. 

SendSim (w,i,qi,bi) and RecvSimQ The master instructs worker w to simulate the g ? th batch of 
system i, for b{ replications. Worker w receives i, qi, bi from the master. 

SendOutput(i, qi, Statj^J and RecvOutput(u;) Worker w sends simulation output Statj j9i for the 
gjth batch of system i to the master. The master stores Statj i9i in memory. 

SendBest() and RecvBest(u;) Worker w sends its estimated-best system i* w (the one in G w with 
the highest batch mean) to the master, together with all batch statistics for system 
{Statj* : r < r w }; the master receives r w and lets r^ ceived <— r w . 

CountBatch(u;) The master finds the largest r current > r w such that Stat^ for all i E G w , 
r w < r < r current have been received by the master. 

SendAction(tc, f lagy,) and RecvAction() The master sends an indicator flag^ to worker w. where 
flag^ = 1 indicates “simulate a batch” and flag^ = 2 indicates “perform screening”. 

SendStats(u;) and RecvStatsQ The master sends Statj >r for all i E G w , r w < r < r current to 
worker w\ the worker receives r- current and lets r new 4— 7 - current ; the worker should have Statj r 
for all i E G w , 0 < r < r new upon completion. 
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SendBestStats(u>) and RecvBestStatsQ The master computes W = {w 1 7 ^ w : \G W '\ > 0)} and 
sends W to worker w; the master then sends all available batch statistics for best systems, 
{Statj* r : w' € W,r < r, r ®, ceived }, to worker re. 

A.3 Full Description of the Hadoop implementation 

We present in this section the full details of the MapReduce implementation of GSP. 

Each Mapper reads a comma-separated string of varied length, denoted by [value 1, value 2,..., $type], 
where the last component $type is used to indicate the specific information captured in the string. 

A Mapper usually runs some simulation, updates batch statistics, and generates one or more key: 
{value} pairs. All pairs under the same key are sent to the same Reducer, which is typically 
responsible for screening. A Reducer may generate one or more comma-separated strings which 
become the input to the Mapper in the next iteration. 

Each system i is coupled with stream; which is used by some random number generator and 
updated each time a random number is generated. The coupling of systems and streams ensures 
that the random numbers generated for each system in each iteration are all mutually independent. 

We also assume that each system i is preallocated to a particular screening group, as determined 
by the function Group (i). 

The procedure begins with Steps 1-3 which implements Stage 1, then enters Stage 2 where Steps 
4 and 5 are run repeatedly for a maximum of f iterations. If multiple systems survive Stage 2 , the 
procedure runs Steps 6 and 7 to finish Stage 3. 

Step 1. • Map: Estimate Sf 

Input [?'] 

Operation Initialize stream,; with seed i; Simulate system i for n\ replications to obtain 
X t (n 1 ) and Sf. 

Output i: Sf, stream,;, $S0} 

• Reduce 

Input i: {Aj(ni), Sf, stream;, $S0} 

Operation Calculate Yfi Si- 
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Output [i, Xi(n i), Sf, stream,, $S0] 


Step 2. • Map: Calculate batch size 

Input [*, Xi(n\) Sf, stream,- SSO] 

Operation Calculate batch size b t using bi = (3Si/(J2i Si/k). 

Output Group(i): {i, X t (n\), m, bi, Sf, stream,;, $Sim} 

• Reduce: Screen within a group 

Input Group: {*, Xj(rej), n t , bi, Sf, stream,, $Sim} for all i in the Group 
Operation Screen all systems in the Group and find the one i* with the highest mean. 
Output [*, Xi(m), ni, bi, Sf, stream,;, $Sim] for each surviving system i, and 
[**, Xi*{ni*), n^, bi*, Sf l, $Best] for the best system i* 

Step 3. • Map: Share best systems between groups 

Input (1) [i, Xi(rii), ni, bi, Sf, stream,;, $Sim] 

Operation (1) Simply output to Group(i). 

Output (1) Group(i): {i, Xi(m), ni, bi, Sf, stream,;, $Sim} 

Input (2) [i* , Xi*{rii*), n^, bi*, Sf l, $Best] 

Operation (2) Output to all groups. 

Output (2) Group: {z*, Xj* (n,*), rz,», bi*, Sf, , $Best} for every Group 

• Reduce: Screen against the best systems from other groups 

Input Group: {*, Xj(n,), n,, bi, Sf, stream,, $Sim} for all i in the Group, and 
Group: {?'*, X,* (rii*), n,*, bi*, Sf, , $Best} from every other Group 
Operation Screen all systems in Group against the best systems from other groups. 
Output [i, Xi{rii), ni, bi, Sf, stream,;, $Sim] for each surviving system i 

Step 4. • Map: Simulation 

Input [*, Xi(rii), ni, bi, Sf, stream,;, $Sim] 

Operation Simulate system i for additional bi replications, update rt,, X,(n,), and stream^. 
Output Group(z): {i, Xi{n\), n\, bi, Sf, stream,;, $Sim} 
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• Reduce: Screen within a group. 


(Same as Step 2 Reduce.) 

Step 5. Screen against best systems from other groups. 

(Same as Step 3.) 

Step 6. • Map: Determine Rinott sample sizes 

Input [i, Xi(ni), Hi, bi , Sf, stream,;, $Sim] 

Operation Output to Reducer. 

Output i: {Xi(rii), n^, Sf, stream,, $Sim} 

• Reduce 

Input i: {Xi(ni), Hi, Sf, stream,;, $Sim} 

Operation Calculate Rinott sample size and divide the additional sample into batches. 
For each batch j, generate a substream stream) using steam,;. 

Output [i, Xi(ni), rii, $S2], and 

for each batch j: [ i , stream), (size of batch j), $S3] 

Step 7. • Map: Simulate additional batches 

Input (1) [i, Xi(ni), ni, $S2] 

Operation (1) Output to Reducer, since this is the batch statistics generated in Stage 2. 
Output (1) 1: {i, Xi(rii), Hi, $S2} 

Input (2) \i, stream), (size of batch j), $S3] 

Operation (2) Simulate batch j of system i for the given batch size using stream), calculate 
batch sample mean Xj. 

Output (2) 1: {i, Xj, (size of batch j), $S3 } 

• Reduce: Merge batches and find the best system 

Input (This step has only one Reducer) 

1: {i, Xi(ni), Hi, $S2} and 

1: (size of batch j), $S3} for all system i and all batch j 
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Operation For each system i, merge all batches (including the one from Stage 2) to form 
a single sample mean. 

Output Report the system i* that has the highest sample mean. 
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